Reuters Dataset

A Multi-class Classification Example

Introduction

The Reuters Dataset contains a set of short newswires and their topics, published by Reuters in 1986. Each topic has at least 10 examples in the training set. Our job is to classify Reuters newswires into 46 different mutually exclusive topics. There are 8982 training examples and 2246 testing examples.

Code
library(keras)

reuters = dataset_reuters(num_words = 10000) 
# 10,000 most frequently occurring words

c(c(train_data, train_labels), c(test_data,test_labels)) %<-% reuters

train_labels[[1]] 
# label associated with an example is an integer between 0 and 45 - a topic index
test_labels[[1]] 
# 1st test label is a topic index of 3 - we will predict this later

Data Preparation

Code
vectorize_sequences <- function(sequences, dimension = 10000) {
  results <- matrix(0, nrow = length(sequences), ncol = dimension)
  for (i in 1:length(sequences)){
    results[i, sequences[[i]]] <- 1 }
  results
}
x_train <- vectorize_sequences(train_data)
x_test <- vectorize_sequences(test_data)

To vectorize labels, there are two possibilities : (1) cast the label list as an integer tensor (2) One-hot encoding.

One-hot encoding is a widely used format for categorical data, also called categorical encoding. Here, one-hot encoding of the labels consists of embedding each label as an all-zero vector with a 1 in the place of the label index.

Code
one_hot_train_labels = to_categorical(train_labels)
one_hot_test_labels = to_categorical(test_labels)

Network Structure

Here we are dealing with output which has 46 classes i.e. the dimensionality of output space is much larger. Earlier there were 2 classes.

In a stack of dense layers, each layer can only access information present in the output of the previous layer. If one layer drops some information relevant to the classification problem, this information can never be recovered by later layers: each layer can potentially become an information bottleneck. In the previous e.g. we used 16-dimensional intermediate layers, but a 16-dimensional space may be too limited to learn to separate 46 different classes; such small layers may act as information bottlenecks, permanently dropping relevant information.

Code
# solution -> use 64 units
model <- keras_model_sequential() %>%
  layer_dense(units = 64, activation = "relu", input_shape = c(10000)) %>%
  layer_dense(units = 64, activation = "relu") %>%
  layer_dense(units = 46, activation = "softmax")

summary(model)

We end the network with a dense layer of size 46 because we have to classify into 46 different classes. The use of softmax activation means that the network will output a probability distribution over the 46 different output classes.

Compile

Code
model %>% compile(optimizer="rmsprop",
                  loss="categorical_crossentropy",
                  metrics=c("accuracy"))

Creating Validation

Code
val_indices = 1:1000  # 1000 samples

x_val = x_train[val_indices, ]
y_val = one_hot_train_labels[val_indices,]

partial_x_train = x_train[-val_indices,]
partial_y_train = one_hot_train_labels[-val_indices,]

Training

Code
history = model %>% fit(partial_x_train, partial_y_train, 
                        epochs=20,batch_size=512,
                        validation_data=list(x_val, y_val))

Figure : Training Logs

The network begins to overfit after 9 epochs !

Code
# change epochs to 9
history = model %>% fit(partial_x_train, partial_y_train, 
                        epochs=9, batch_size=512,
                        validation_data = list(x_val, y_val))
# training accuracy = 96.64% # validation accuracy = 79%
results = model %>% evaluate(x_test, one_hot_test_labels) 
# test accuracy = 78.54%

Accuracy around 78-79%. In case of a balanced binary classification problem, the accuracy reached by a purely random classifier would be 50%

Code
# Purely Random Classifier
test_labels_copy = test_labels
test_labels_copy = sample(test_labels_copy)
length(which(test_labels==test_labels_copy))/length(test_labels) # 18%

But in this case it’s closer to 18%, so the results seem pretty good, at least when compared to a random-baseline.

Predictions

Code
predictions = model %>% predict(x_test)

dim(predictions) # 2246 46
# returns a probability distribution over all 46 topics thus vector of 
# length 46 for all 2246 test examples

which.max(predictions[1,]) # 4
# class with highest probability is '4' - as we saw above, 
# the actual topic index is 3

predictions[1,4]
# topic index 4 is predicted with a probability of 98.499%

predictions[1,3]
# topic index 3 is predicted with a probability of 0.0001%

Model with Information Bottleneck

Code
model = keras_model_sequential() %>%
  layer_dense(units = 64, activation = "relu", 
              input_shape = c(10000)) %>%
  layer_dense(units = 4, activation = "relu") %>%
  layer_dense(units = 46, activation = "softmax")

model %>% compile(optimizer="rmsprop",loss="categorical_crossentropy",
                  metrics=c("accuracy"))

model %>% fit(partial_x_train, partial_y_train, epochs=20, batch_size=128,
              validation_data = list(x_val, y_val))

Figure : Training Logs

The network peaks at 71% validation accuracy i.e. 8% absolute drop - mostly due to the fact that we’re trying to compress a lot of information into an intermediate space that is too low-dimensional. The network is able to cram most of the necessary information into these 8-dimensional representations, but not all of it.

Structural Experimenting

Code
model = keras_model_sequential() %>%
  layer_dense(units = 64, activation = "relu", 
              input_shape = c(10000)) %>%
  layer_dense(units = 64, activation = "relu") %>%
  layer_dense(units = 64, activation = "relu") %>%
  layer_dense(units = 46, activation = "softmax")

model %>% compile(optimizer="rmsprop",
                  loss="categorical_crossentropy",
                  metrics=c("accuracy"))

history = model %>% fit(partial_x_train, partial_y_train, 
                        epochs=9, batch_size=512,
                        validation_data = list(x_val, y_val))

results = model %>% evaluate(x_test, one_hot_test_labels)
Number of Layers Hidden Nodes Accuracy
2 32, 32 76.58% (slightly lower)
2 64, 32 77.69% (close)
3 128, 128, 64 77.52% (lower)
3 64, 64, 64 77.03% (lower)